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Karyotypic diversification is more prominent in Equus species than in other mammals. Here, using next 
generation sequencing technology, we generated and de novo assembled quality genomes sequences for a 
male wild horse (Przewalski's horse) and a male domestic horse (Mongolian horse), with about 93-fold and 
91 -fold coverage, respectively. Portion of Y chromosome from wild horse assemblies (3 M bp) and 
Mongolian horse (2 M bp) were also sequenced and de novo assembled. We confirmed a Robertsonian 
translocation event through the wild horse's chromosomes 23 and 24, which contained sequences that were 
highly homologous with those on the domestic horse's chromosome 5. The four main types of 
rearrangement, insertion of unknown origin, inserted duplication, inversion, and relocation, are not evenly 
distributed on all the chromosomes, and some chromosomes, such as the X chromosome, contain more 
rearrangements than others, and the number of inversions is far less than the number of insertions and 
relocations in the horse genome. Furthermore, we discovered the percentages of LINE_L1 and LTR_ERV1 
are significantly increased in rearrangement regions. The analysis results of the two representative Equus 
species genomes improved our knowledge of Equus chromosome rearrangement and karyotype evolution. 



Horses are recognized as extremely successful domestic animals. Humans in many parts of the world have 
relied on them for thousands of years'. The genus Equus originated on the North American continent and 
migrated 2.6 million years ago over the Bering Strait during the Ice Age''. Horses, donkeys, and zebras 
evolved from the same ancestor. The speciation events were accomplished through acute chromosomal rearran- 
gements, with the rearrangement rate ranging from 2.9 to 22.2 per million years, which is significantly higher than 
in other mammals' Equus species possess widely varying diploid chromosome numbers, from 2n = 32 
(Mountain zebra) to 66 (Przewalski's horse). Przewalski's horse has a different chromosome number than 
domestic horses because of a Robertsonian translocation, resulting in one pair of metacentric chromosomes 
(ECA5) split into two pairs of acrocentric chromosomes'" '^ (EPR23 and EPR24). Although the offspring pro- 
duced from a cross between Przewalski's horse and a domestic horse had 65 chromosomes, it was fertile"''^, unlike 
the mule (2n = 63, offspring of male donkey and female horse) and the hinny (2n = 63, offspring of male horse 
and female donkey), which are sterile. 

Przewalski's horse ("wild horse" hereafter) is the only wild horse species surviving in the world today". Because 
of environmental change and human activities, this species dropped to only 12 individuals in the middle of the last 
century. Today, the number has increased to approximately 2000, located in the field or in zoos, but all of them are 
descendants of those 12 ancestors'^. This event dramatically reduced the genetic variation of the wild horse, which 
could reduce the ability of the species to adapt to environment change. Severe genetic bottlenecks have also 
occurred with European bison"', northern elephant seals'^ and cheetahs'". Therefore, the wild horse is not only a 
valuable wildlife resource but also a promising model for the study of population genetics. The Mongolian horse is 
an ancient horse breed that has been an integral part of the culture of nomadic pastoralists in North Asia. The 
Mongolian horse has a large population with abundant genetic diversity. This ancient breed has influenced other 



SCIENTIFIC REPORTS | 4:4958 | DDI: 1 0. 1 038/srep04958 



1 



Northern European horse breeds". It has acquired many special 
abilities and attributes, such as endurance and disease resistance, 
and is well adapted to its harsh conditions — a cold, arid climate 
and poor grazing opportunities^". Dramatic chromosomal rearran- 
gement in the horse is a notable feature in comparison to other 
mammals, and this makes the horse an ideal model for studying 
chromosomal evolution. 

In this study, we obtained quality whole-genome sequences of a 
male wild horse and a male Mongolian horse using next-generation 
sequencing technology. The genome sequences of the two represent- 
ative Equus species would improve the genomic maps of the horse. 
Importantly, based on this, we will focus on karyotypic diversifica- 
tion and explore the genetic mechanisms and evolution rules 
through analysis of comparative genomics, further uncovering the 
genetic mechanisms of chromosomal evolution for Equus species. 

Results 

Genome sequencing and assembly. The wild horse and Mongolian 
horse genomes were sequenced using the lUumina Hiseq platform. A 
paired-end library (500 bp) and two mate-paired libraries (3 kb, and 
8 kb) were constructed for both the wild horse and Mongolian horse. 
In total, we generated 231.21 Gb and 224.17 Gb of usable sequences 
for the wild horse and Mongolian horse. The sequence depth was 
93 X and 91 X, respectively (Supplementary Table SI). The 



sequencing error rates were 0.000575 and 0.000507 for the wild 
horse and Mongolian horse, respectively (Supplementary Table 
S2). After assembly, both the wild horse and Mongolian horse gene- 
rated the same length of genome sequences (2.38 Gb) (Supplemen- 
tary Table S3). 

We checked 248 core eukaryotic genes^' in our two assemblies, and 
found the completeness was comparable with that of published gen- 
omes sequences assemblies^^"^'' (Supplementary Table S4). We did 
not detect any misassemblies^' when comparing our genome assem- 
blies with the wild horse and Mongolian horse sequences available in 
Genbank. (Supplementary Table S5). 

We assembled Y chromosome of wild horse and Mongolian horse 
(Fig. 1). In previous studies^*, 127 markers on horse Y chromosome 
were reported, and in this study, 87 markers and 103 markers could 
be detected in wild horse and Mongolian horse assemblies, respect- 
ively (Supplementary Table S6, 7). Thus, 34 scaffolds (3,018,288 bp) 
of wild horse and 48 scaffolds (1,971,029 bp) of Mongolian horse 
were identified originated from Y chromosome. The length of col- 
linearity regions between wUd horse and Mongolian horse was 
around 1.74 Mbp. 

To improve gene prediction accuracy, eight types of tissue samples 
(heart, liver, spleen, lung, kidney, brain, spinal cord and muscle) 
from a female Mongolian horse were used to construct cDNA lib- 
raries. The RNA-seq was performed using the 454 FLX+ platform, 
and 853,978 reads were obtained with an average length of 458 bp 
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Figure 1 | Scaffolds of Y chromosome of wild horse and Mongolian horse. Thirty- four scaffolds of wild horse and 48 scaffolds of Mongolian horse are 
shown in this figure, and coUinearity regions are linked. Numbers located outside of the brackets are the scaffolds ID of wild horse (carmine) and 
Mongolian horse (green). Numbers located inside of the brackets represent count of markers detected in the scaffolds. 
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Figure 2 | Synteny analysis. Microsynteny between chromosome 5 of 
domesticated horses (ECA5) and chromosomes 23 and24 of wild horses 
(EPR23, EPR24). Locally CoUinear Blocks (LCBs) are marked with the 
same color and connected by straight lines. The probes (LAMC2, LAMB3, 
VCAMl, UOX, DIAl), which are used for FISH, are also detected in this 
figure. 

(Supplementary Fig. SI). From these transcriptome data, aided by 
homology-based gene prediction methods, we estimated that the 
horse genome contained 20,000 to 21,000 protein- coding genes. 

Synteny analysis. Robertsonian translocation, which is also called 
whole-arm translocation or centric-fusion translocation, is a 
common form of chromosomal rearrangement. Previous studies 
based on fluorescence in situ hybridization (FISH) results indicate 
that chromosomes 23 and 24 of the wild horse are homologous with 
chromosome 5 of the domestic horse. After assembling the wild 
horse and Mongolian horse genome, we masked out all repetitive 
sequences and found that EPR23 and 24 of the wild horse and ECA5 
of the Mongolian horse could be aligned to the chromosome 5 of the 
reference genome^'. The five probes {LAMC2, LAMB3, VCAMl, 
UOX and DIAl), which were used in FISH mapping in previous 
research to confirm that EPR23, 24 is homologous with ECA5", 
were also identified in both the wild horse and domestic horse 
genome (Fig. 2). 

To study the relationship between Robertsonian translocation and 
local rearrangement, we performed whole genome synteny analysis. 
We compared wild horse genome and Mongolian horse genome to 
the Thoroughbred horse genome, respectively. CoUinearity region 
between Mongolian horse and Thoroughbred horse (2.25 Gbp) was 
slightly longer than that between wild horse and Thoroughbred 
horse (2.23 Gbp). 124 Mbp (5.51%) of wild horse genome and 
76 Mbp (3.34%) of Mongolian horse genome could not align to 
Thoroughbred horse genome. Four types of rearrangement, BRK 
(insertion of unknown origin), DUP (inserted duplication), INV 
(inversion), and JMP (relocation), were identified (Supplementary 
Table S8, 9). 

Since artifactual mis-joins of assemblies could be counted as rear- 
rangements, we attempted to estimate the correct rate of these rear- 
rangements breakpoints. We remapped the usable reads to the 
genomes assemblies of wild horse and Mongolian horse, respectively. 
Then we checked the number of mapped reads in the breakpoint of 
each type of rearrangements. If the number was less than three, we 
considered the assembly was incorrect (Supplementary Fig. S2), 
otherwise correct (Supplementary Fig. S3). We counted 100 break- 
points for each type of rearrangement, and calculated the correct rate. 
The correct rates of INV (92%) and JMP (82%) were higher than 
those for BRK (76%) and DUP (58%) in assemblies of wild horse. In 
Mongolian horse, the correct rates were similar with those of wild 
horse (Supplementary Table SIO). 

The potential rearrangement sites were investigated for potential 
synapomorphies. We counted the rearrangement events in two situa- 
tions: (1) Assume genome sequences of Thoroughbred horse and 
Mongolian horse were consensus, and identify rearrangements in 
wild horse; (2) Assume genome sequences of Thoroughbred horse 
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Figure 3 | Local rearrangements in the wild horse and Mongolian horse. 

Chromosome 5 of domestic horse had undergone Robertsonian 
translocation (marked as yellow). Thoroughbred horse genome was used 
as the reference, so the chromosome undergone Robertsonian 
translocation was also chromosome 5 for wild horse in this figure. BRK: 
insertion of unknown origin; DUP: inserted duplication; INV: inversion; 
JMP: relocation. 



and wild horse were consensus, and identify rearrangements in 
Mongolian horse. We found that rearrangement events in the first 
situation were dramatically more than that in the second (Supple- 
mentary Fig. S4). This result was consistent with phylogeny. 

The numbers of rearrangements on each chromosome were 
counted (Supplementary Table Sll). Chromosome 5 does not have 
a greater number of local rearrangements compared with the other 
chromosomes, although chromosome 5 had undergone Robert- 
sonian translocation (Fig. 3). We noticed that the number of inver- 
sions is far less than that of insertions and relocations in the horse 
genome. Some chromosomes, including the X chromosome, contain 
more local rearrangements than others. Local rearrangements in the 
genome of wild horse are more numerous than in that of the 
Mongolian horse. 
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Figure 4 | Analysis of repetitive sequences, (a) The proportions of repetitive sequences among six species of mammals. Seven common repetitive 
sequences are marked in red, and the subclasses are marked in black, (b) The content of repetitive sequences is significantly increased in rearrangements 
regions compared with the collinearity region. The "p-value" is shown on the top. (c) Some repetitive sequences representing content greater than 0.5% 
of the genome. The content of repetitive sequences significantly increased in BRK/DUP/INV/JMP regions compared with the collinearity region. 
'*' p-value < 0.05. 



Repetitive sequences. Repetitive sequences comprise approximately 
50% of the mammal genomes^" and are associated with syntenic 
breakpoints and chromosomal fragility" Repetitive sequences of 
six species of mammals (horses^', humans™, mouse'"', dogs''\ cattle""" 
and pigs") were examined in this study (Fig. 4a). Seven common 
repetitive sequences were identified: short interspersed repeated 
sequences (SINE), long interspersed repeated sequences (LINE), 
long terminal repeated (LTR), DNA elements, satellites, simple 
repeats and low complexity. Broadly, the analysis of these 
sequences indicated that 4L4% of the horse genome sequences are 
repetitive sequences, which is comparable to the percentages in 



humans (46.8%), mouse (42.5%), dogs (40.0%), cattle (47.1%), and 
pigs (39.1%). LINEs comprise 22.6% of the horse genome, which is 
more than in humans (19.7%), mouse (19.1%), dogs (19.8%), cattle 
(21.9%), and pigs (18.4%). SINEs can be found in 7.3% of the horse 
genome, less than in the human (13.4%), mouse (7.5%), dog (10.6%), 
cattle (17.0%), and pig (13.0%) genomes (Supplementary Fig. S5). 

The distribution of repetitive sequences in each chromosome was 
also examined. The results indicated that each chromosome contains 
a similar proportion of repetitive sequences, except the X chro- 
mosome, which contains a higher proportion of repetitive sequences 
than autosomes in the six species. 
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Using those rearrangement regions of the wOd horse genome, 
we studied the association between rearrangement and repetitive 
sequences. We found some types of repetitive sequences were 
significantly increased in the rearrangement regions (Fig. 4b, 
Supplementary Table SI 2). This result is consistent with previous 
findings"^^. Interestingly, the proportions of LINE_L1 and LTR_ 
ERVl increased, but the proportions of LINE_L2 and several other 
repetitive sequences decreased (Fig. 4c, Supplementary Table S13 to 
S16). This result suggests that LINE_L1 and LTR_ERV1 may play a 
more important role in chromosome rearrangement. 

Heterozygosity analysis. We identified 1,280,203 and 2,203,945 
heterozygous SNPs (within an individual) in the genomes of the 
wild horse and Mongolian horse (Supplementary Table SI 7 to 
S19). Small indels were also identified in the genomes of the wild 
horse and Mongolian horse (Supplementary Table S20). The 
heterozygosity rates were 0.52 X 10"'' and 0.89 X 10"' in the wild 
horse and Mongolian horse, respectively. The heterozygosity of the 
wild horse is considerably lower than that of the Mongolian horse. 

SNPs were not evenly distributed among the wild horse chromo- 
somes but were evenly distributed in the Mongolian horse chromo- 
somes (Fig. 5a). We explored the heterozygosity rates of different 
regions using sliding windows of 50 kb with a step size of 10 kb. The 
number of sliding windows with high heterozygosity rate in 
Mongolian horse genome was considerable greater than that in wild 
horse genome (Supplementary Fig. S6). Another interesting phe- 
nomenon found in the genome of the wild horse was that heterozyg- 
ous SNPs were completely excluded in many large regions (Fig. 5b). 
The sequence coverage of those regions in the wild horse was the 
same as in the Mongolian horse (Fig. 5c, Supplementary Fig. S7). 

In the wild horse, there is a total length of 1287 M of homozygous 
regions (there are no SNP in wild horse, and there are more than 
0.8SNP/Kbp in Mongolian horse) and 58 homozygous regions were 
larger than 1 Mbp (Supplementary Table S21). A total of 4508 genes 
were located in those homozygous regions. Enrichment analysis 
indicated that these genes were enriched for specific functional cat- 
egories of olfactory transduction (n = 118, p_value = 8.90e-12), 
regulation of cell proliferation (n = 1 1, p_value = 1.70e-02), calcium 
ion binding (n = 9, p_value = 1.70e-02) and others. 

Discussion 

In the past decades, researchers have studied chromosomal 
rearrangement using different conventional methods such as chro- 
mosome banding and FISH. However, many local rearrangements 
are extremely difficult to detect. Here, we sequenced and de novo 
assembled the homologous chromosomes that had undergone 
Robertsonian translocation. Our study indicated that Robertsonian 
translocation did not increase local rearrangements. These findings 
indicated that Robertsonian translocation and local rearrangements 
may be caused by different mechanisms. From our results, inversions 
are rarer than insertions and relocation, suggesting that insertions 
and relocation may play a more important role in shaping the 
genome. 

Some studies have demonstrated that repetitive sequences are 
associated with syntenic breakpoints and chromosomal fragility"'''^. 
This study did not reveal significant differences in repetitive 
sequences among different species and different chromosomes 
(except the X chromosome). Different strategies of genome sequen- 
cing (clone by clone and whole genome shot-gun) may impact the 
actual content of repetitive sequences in the genomes. Our results 
suggest that chromosomal local rearrangements are highly assoc- 
iated with repetitive sequences. However, these repetitive sequences 
did not contribute equally to rearrangement. LINE_L1 and 
LTR_ERV1 may play a more important role than other repetitive 
sequences. 



In the middle of the last century, the population of the wild horse 
dropped to only 12 individuals. The genetic bottleneck and inbreed- 
ing caused by this event may be the reason for the many more 
homozygous regions in the wild horse genome. One interesting result 
is that heterozygous SNPs are completely excluded from chro- 
mosome 26 of the wild horse. It was the largest fragment without 
heterozygous SNPs. As sequencing coverage of EPR26 is similar to 
other autosomes, we confirmed that there was a pair of chromosome 
26 in this individual (Supplementary Fig. S8). Another explanation 
could be that this pair of chromosome 26 was present because of 
uniparental isodisomy'" We also sequenced a short region 
(—700 bp) in chromosome 26 of several other wild horse samples 
using the Sanger method and found 6 SNPs, indicating that this 
region is heterozygous in some other wild horses. 

The analysis results of the two representative Equus species 
improved the genomic maps of the horse. It also revealed the unique 
aspects of the chromosomal rearrangement and improved our 
understanding of chromosomal evolution in mammals implicating 
Equus is thus a promising model to explore the Karyotypic instab- 
ility. These analysis and discoveries would benefit studies of mammal 
karyotypic evolution and chromosomal rearrangement, and studies 
of human disease caused by chromosome aberration. 

Methods 

Sampling and genome sequencing. Protocols used for this experiment were 
consistent with those approved by the Institutional Animal Care and Use Committee 
at Inner Mongolia Agricultural University. For sequencing, a male wild horse was 
selected from the "YE MA International Group" of Xinjiang, China, and a Mongolian 
horse was selected from the Xilingol League of Inner Mongolia, China. DNA was 
extracted from ear tissue and peripheral blood cells. lUumina HiSeq 2000 was used to 
sequence the genomes of wild horse and Mongolian horse using a shotgun strategy. A 
pair-end library (500 bp, standard genomic library and sequenced using paired-end 
reads) and two mate-pair libraries (3 kb, and 8 kb) were constructed for each horse. 
The length of reads was 101 bp for pair-end library and two mate-pair libraries. 
Library preparation and sequencing followed the manufacturer's instructions, and 
sequence reads were collected from the lUumina data processing pipeline. 

Data filtering. The following types of reads were filtered out; (!) reads with more 
than 3 unidentified nucleotides, (2) reads with average phred quality below Q30, and 
(3) reads with unidentified nucleotides in the first 50 nucleotides. 

Genome assembly. The genome sequences of the wild horse and Mongolian horse 
were assembled with short reads using SOAPdenovo*^ We first assembled the short 
reads of the pair-end library {500 bp) into contigs using sequence overlap 
information. Then, we used the information of the mate-pair libraries (3 kb and 8 kb) 
to join the contigs into scaffolds. Finally, "Gapcloser" (http://soap.genomics.org.cn/ 
soapdenovo.html) was used to close the gaps inside the scaffolds. 

Genome annotation. For protein-coding gene annotation, ab initio prediction was 
performed by MAKER*^. We generated cDNA data from multiple RNA sources. 
Using an oligodT-based approach, cDNA libraries were constructed from eight types 
of tissue samples (heart, liver, spleen, lung, kidney, brain, spinal cord and muscle) 
from a female Mongolian horse. The library was sequenced using a Roche 454 FLX+ 
platform. 

Estimated sequencing error rate. Data from the X chromosome was used to estimate 
the sequencing error rate, which is hemizygotic in males. The calculations were not 
influenced by heterozygous SNPs''''. All qualified reads from the wild horse and 
Mongolian horse were mapped to the X chromosome of Equus caballus, and all 
repetitive and low complexity regions were excluded. At each nucleotide position, the 
predominant call was assumed to be true, and aU others were considered to be errors. 

Synteny analysis. We used the MAUVE program''^ to construct the synteny map for 
chromosome 5 of domestic horses (Thoroughbred and Mongolian horse) and 
chromosomes 23 and 24 of the wild horse. We masked out aU repetitive sequences, 
and unique sequences were preserved. Then, we used the Mauve Contig Mover 
(MCM) to order the draft genomes of the wild horse and Mongolian horse relative to 
the Thoroughbred horse genome (Equcab 2.0). The synteny analysis used 
progressiveMAUVE. 

We used MUMmer*^ to perform the synteny analysis for the whole genomes of the 
wild horse and Mongolian horse, in addition to the reference genome. Four types of 
rearrangements (BRK, DUP, INV, JMP) were identified using the "nucmer" module. 
The parameter was Options "-c 800 -g300 -1 100". 

Repetitive sequence analysis. We screened DNA sequences for interspersed repeats 
and low complexity DNA sequences using RepeatMasker (http://www.repeatmasker. 
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Figure 5 | Effect of genetic bottleneck on genome landscape, (a) The SNPs distribution of each chromosome in the wild horse and Mongolian horse. For 
the Mongolian horse, the SNP distribution of each autosome is similar, but for the wild horse, the SNP distribution among the autosomes is different, 
and there are no SNPs on EPR26 (ECA25 in this figure), (b) Contrast of heterozygous SNPs between the wild horse and Mongolian horse, (c) The 
sequencing depths of chromosome 21 and 25. 
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org/) in the collinearity and rearrangement regions. Collinearity regions, which were 
larger than 100 kb, were used for following analysis, and rearrangement loci plus 2 kb 
extended flanking regions were treated as rearrangement regions. The "T-test" was 
performed using R software. 

SNP calling and heterozygosity rate estimation. We utilized the BWA program^^ to 
map the usable reads from the pair-end libraries (500 bp) of the wild horse and 
Mongolian horse to the genome sequences of Thoroughbred horse (Equcab 2.0). The 
parameters chosen for mapping were as follows: seed length of 32, and the maximum 
occurrences for extending a long deletion of 10. Duplicated reads were removed by 
SAMtools*^. SNPs and InDels were called using the Genome Analysis Toolkit*^ 
according to the guidelines as described. 

The heterozygosity rate was estimated as the density of heterozygous SNPs for the 
whole genome. For the estimation of local heterozygosity rate, sliding windows of 
50 kb with 80% overlap between adjacent windows were used to scan the genome. 
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